Exploratory Analysis

Instructions

Step 0:
A few words of caution:
1) Read all the way through the instructions.
2) Models must be built using Python.
3) No additional data may be added or used.
4) Not all data must be used to build an adequate model, but making use of complex variables will help us identify high-performance candidates.
5) The predictions returned should be the class probabilities for belonging to the positive class, not the class itself (i.e. a decimal value, not just 1 or 0). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.

Step 1:
Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers).

Step 2:
Build your models: Please build two distinctly different machine learning/statistical models to predict the value for y. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.

Step 3:
Create predictions on the test dataset using both of your trained models. The predictions should be the class probabilities for belonging to the positive class (labeled ë1í). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset. Save the results of the two models in a separate CSV files titled ìresults1.csvî and ìresults2.csvî. A result file should each have a single column representing the output from one model.

Step 4:
Submit your work: In addition to the two result files (CSV format), please submit all of your code for cleaning, prepping, and modeling your data (text, html, or PDF preferred), and a brief write-up comparing the pros and cons of the two modeling techniques you used (PDF preferred). Please do not submit the original data back to us. Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by AUC - on the data hold out, an understanding of the two techniques you compared in your write-up, and your overall code.

In [63]:
import numpy as np
import pandas as pd

# models
import sklearn
from sklearn import manifold
from sklearn import naive_bayes
from sklearn import svm
from sklearn import ensemble

# plots
import cufflinks as cf
import plotly
import plotly.io as pio
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True)

# to make this notebook's output identical at every run
np.random.seed(42)

plt_filepath = 'plots/EDA/'
In [65]:
print('numpy version: {}'.format(np.__version__))
print('pandas version: {}'.format(pd.__version__))
print('sklearn version: {}'.format(sklearn.__version__))
print('cufflinks version: {}'.format(cf.__version__))
print('plotly version: {}'.format(plotly.__version__))
numpy version: 1.16.2
pandas version: 0.24.1
sklearn version: 0.20.0
cufflinks version: 0.14.6
plotly version: 3.6.1

Preprocess Data

In [2]:
X = pd.read_csv('data/exercise_01_train.csv')
X.head()
Out[2]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x91 x92 x93 x94 x95 x96 x97 x98 x99 y
0 10.142889 -15.675620 3.583176 -22.397489 27.221894 -34.110924 -0.072829 -0.544444 0.997601 -2.691778 ... 1.916575 5.240820 euorpe 2.431170 0.454074 -18.572032 -14.291524 0.178579 18.110170 0
1 -52.214630 5.847135 -10.902843 -14.132351 20.588574 36.107322 0.115023 0.276093 -0.699168 -0.972708 ... 0.370941 -3.794542 asia 2.592326 31.921833 3.317139 10.037003 -1.930870 -3.486898 0
2 67.718500 2.064334 12.394186 -18.667102 47.465504 -50.373658 0.253707 1.068968 2.939713 2.691218 ... 1.449817 12.470532 asia 7.143821 9.401490 -10.604968 7.643215 -0.842198 -79.358236 0
3 -28.003111 8.565128 -8.592092 5.918960 -3.224154 78.315783 -0.879845 1.176889 -2.414752 0.589646 ... -3.274733 3.484450 asia -4.998195 -20.312810 14.818524 -9.180674 1.356972 14.475681 0
4 80.703016 30.736353 -30.101857 -21.201140 -91.946233 -47.469246 -0.646831 -0.578398 0.980849 -1.426112 ... -0.644261 4.082783 asia -0.012556 -29.334324 1.734433 -12.262072 -0.043228 -19.003881 0

5 rows × 101 columns

In [3]:
"There are {} rows and {} columns in the train set".format(X.shape[0], X.shape[1])
Out[3]:
'There are 40000 rows and 101 columns in the train set'

Class Label

In [4]:
import plotly

plotly.__version__
Out[4]:
'3.6.1'
In [4]:
import plotly

plotly.__version__
Out[4]:
'3.6.1'
In [5]:
y_counts = X.y.value_counts()
data = [go.Bar(
        x=y_counts.index,
        y=y_counts.values  )]

title = 'Y Label: Value Counts'
layout = go.Layout( {'title': title } )

fig = go.Figure(data=data, 
                layout=layout
               )

iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')

Data

In [6]:
X.dtypes.value_counts()
Out[6]:
float64    94
object      6
int64       1
dtype: int64

Categorical, Object Columns

Let's investigate what the object columns are

In [7]:
objects = X.select_dtypes(include='object')
objects = objects.assign(y=X.y)
objects.head()
Out[7]:
x34 x35 x41 x45 x68 x93 y
0 bmw wed $-54.1 0.0% Jun euorpe 0
1 nissan thur $-229.32 0.01% July asia 0
2 Honda wed $243.68 -0.01% July asia 0
3 Toyota thur $126.15 0.02% May asia 0
4 bmw thurday $877.39 -0.02% July asia 0

Columns x45 and x68 are actually numerical -- let's clean these and add those back to the other numerical columns

In [8]:
X.x41.value_counts()[:5]
Out[8]:
$156.29     4
$-511.36    4
$680.85     4
$-100.95    4
$-370.55    4
Name: x41, dtype: int64
In [9]:
X['x41'] = pd.to_numeric(X.x41.str.replace('$', ''))
In [10]:
X.x45.value_counts()
Out[10]:
0.01%     9546
-0.01%    9545
0.0%      7958
-0.0%     7600
0.02%     2389
-0.02%    2388
-0.03%     284
0.03%      258
-0.04%      15
0.04%       12
Name: x45, dtype: int64
In [11]:
X['x45'] = pd.to_numeric(X.x45.str.replace('%', ''))
In [12]:
objects = objects.drop(['x41', 'x45'], axis=1)
objects.head()
Out[12]:
x34 x35 x68 x93 y
0 bmw wed Jun euorpe 0
1 nissan thur July asia 0
2 Honda wed July asia 0
3 Toyota thur May asia 0
4 bmw thurday July asia 0

For the remaining categorical columns, let's relabel these with more appropriate names

In [13]:
old_labels = ['x34', 'x35', 'x68', 'x93', 'y']
labels = ['car_manufacturer', 'day', 'month', 'market', 'y']
In [14]:
objects.columns = labels
objects.head()
Out[14]:
car_manufacturer day month market y
0 bmw wed Jun euorpe 0
1 nissan thur July asia 0
2 Honda wed July asia 0
3 Toyota thur May asia 0
4 bmw thurday July asia 0

Let's plot the categorical counts against the label to predict

In [15]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].car_manufacturer.value_counts()
y_label_1 = y_index.loc[1].car_manufacturer.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")
# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]

title = "Car Manufacturer Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)

iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')

Now let's explore the value distribution

In [16]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].day.value_counts()
y_label_1 = y_index.loc[1].day.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]

title = "Day of the Week Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)

iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
In [17]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].month.value_counts()
y_label_1 = y_index.loc[1].month.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]

title = "Month Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)

iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
In [18]:
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].market.value_counts()
y_label_1 = y_index.loc[1].market.value_counts()

# create trace1 
trace1 = go.Bar(
                x = y_label_0.index,
                y = y_label_0.values,
                name = "Label 0")

# create trace2 
trace2 = go.Bar(
                x = y_label_1.index,
                y = y_label_1.values,
                name = "Label 1")

data = [trace1, trace2]

title = "Market Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)

iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')

Numerical Data

In [19]:
X.dtypes.value_counts()
Out[19]:
float64    96
object      4
int64       1
dtype: int64
In [20]:
nums = X.select_dtypes(exclude='object')
nums.head()
Out[20]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x90 x91 x92 x94 x95 x96 x97 x98 x99 y
0 10.142889 -15.675620 3.583176 -22.397489 27.221894 -34.110924 -0.072829 -0.544444 0.997601 -2.691778 ... -151.134483 1.916575 5.240820 2.431170 0.454074 -18.572032 -14.291524 0.178579 18.110170 0
1 -52.214630 5.847135 -10.902843 -14.132351 20.588574 36.107322 0.115023 0.276093 -0.699168 -0.972708 ... -58.955871 0.370941 -3.794542 2.592326 31.921833 3.317139 10.037003 -1.930870 -3.486898 0
2 67.718500 2.064334 12.394186 -18.667102 47.465504 -50.373658 0.253707 1.068968 2.939713 2.691218 ... -74.014931 1.449817 12.470532 7.143821 9.401490 -10.604968 7.643215 -0.842198 -79.358236 0
3 -28.003111 8.565128 -8.592092 5.918960 -3.224154 78.315783 -0.879845 1.176889 -2.414752 0.589646 ... 165.859181 -3.274733 3.484450 -4.998195 -20.312810 14.818524 -9.180674 1.356972 14.475681 0
4 80.703016 30.736353 -30.101857 -21.201140 -91.946233 -47.469246 -0.646831 -0.578398 0.980849 -1.426112 ... -174.486251 -0.644261 4.082783 -0.012556 -29.334324 1.734433 -12.262072 -0.043228 -19.003881 0

5 rows × 97 columns

In [21]:
nums.describe()
Out[21]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x90 x91 x92 x94 x95 x96 x97 x98 x99 y
count 39986.000000 39990.000000 39994.000000 39990.000000 39994.000000 39990.000000 39993.000000 39987.000000 39994.000000 39993.000000 ... 39993.000000 39998.000000 39994.000000 39989.000000 39993.000000 39985.000000 39993.000000 39995.000000 39987.000000 40000.000000
mean 8.259955 -3.249786 1.030666 -0.747566 0.283820 -1.773510 -0.000232 -0.016107 -0.651093 -0.014688 ... -14.274809 0.011390 0.003948 -0.050510 -0.007572 -0.629241 -1.986671 0.036482 1.486887 0.201175
std 38.374182 15.171131 24.732185 15.225730 42.240018 42.124100 1.065955 3.382644 2.947472 1.906496 ... 154.038206 3.311041 8.763944 4.979690 19.238210 16.915222 14.375663 5.633052 36.926796 0.400884
min -140.780478 -64.493908 -105.388182 -63.804916 -158.195975 -169.237259 -4.133490 -12.966970 -12.037625 -7.446200 ... -674.004008 -12.807938 -38.121111 -21.578977 -87.669573 -77.010252 -57.709983 -23.588876 -154.559512 0.000000
25% -17.800204 -13.458580 -15.565461 -11.078276 -28.246509 -30.391354 -0.723098 -2.299081 -2.628856 -1.299759 ... -116.645845 -2.218739 -5.925508 -3.435180 -12.895717 -11.948902 -11.686033 -3.770599 -23.559519 0.000000
50% 8.354662 -3.386601 1.132995 -0.714888 0.292788 -1.753365 0.001105 -0.003556 -0.659223 -0.028170 ... -11.471306 -0.006726 0.009306 -0.037111 0.124945 -0.481374 -2.026059 0.041838 1.465346 0.000000
75% 33.829780 6.881661 17.677615 9.552404 28.719663 26.844781 0.715844 2.259972 1.322101 1.263469 ... 90.101751 2.238996 5.909011 3.299108 12.988509 10.793171 7.611660 3.840100 26.548474 0.000000
max 177.399176 62.906822 99.394915 59.338352 179.342581 170.894497 5.311653 16.619445 14.994937 7.300186 ... 603.911528 14.982369 35.785334 20.983463 78.785164 70.182932 60.481075 22.759016 143.126382 1.000000

8 rows × 97 columns

In [22]:
nums_subset_plus_y = nums.iloc[:500,:5]
nums_subset_plus_y = nums_subset_plus_y.assign(y=nums.y)
nums_subset_plus_y.dropna(inplace=True)
nums_subset_plus_y.head(1)
Out[22]:
x0 x1 x2 x3 x4 y
0 10.142889 -15.67562 3.583176 -22.397489 27.221894 0
In [23]:
cf.__version__
Out[23]:
'0.14.6'
In [29]:
nums_subset_plus_y.drop('y', axis=1).iplot(title='Line Chart of First 10 Columns')
In [30]:
nums_subset_plus_y.iplot(kind='box', title='Histogram of First 10 Columns')
In [31]:
nums.head()
Out[31]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 ... x90 x91 x92 x94 x95 x96 x97 x98 x99 y
0 10.142889 -15.675620 3.583176 -22.397489 27.221894 -34.110924 -0.072829 -0.544444 0.997601 -2.691778 ... -151.134483 1.916575 5.240820 2.431170 0.454074 -18.572032 -14.291524 0.178579 18.110170 0
1 -52.214630 5.847135 -10.902843 -14.132351 20.588574 36.107322 0.115023 0.276093 -0.699168 -0.972708 ... -58.955871 0.370941 -3.794542 2.592326 31.921833 3.317139 10.037003 -1.930870 -3.486898 0
2 67.718500 2.064334 12.394186 -18.667102 47.465504 -50.373658 0.253707 1.068968 2.939713 2.691218 ... -74.014931 1.449817 12.470532 7.143821 9.401490 -10.604968 7.643215 -0.842198 -79.358236 0
3 -28.003111 8.565128 -8.592092 5.918960 -3.224154 78.315783 -0.879845 1.176889 -2.414752 0.589646 ... 165.859181 -3.274733 3.484450 -4.998195 -20.312810 14.818524 -9.180674 1.356972 14.475681 0
4 80.703016 30.736353 -30.101857 -21.201140 -91.946233 -47.469246 -0.646831 -0.578398 0.980849 -1.426112 ... -174.486251 -0.644261 4.082783 -0.012556 -29.334324 1.734433 -12.262072 -0.043228 -19.003881 0

5 rows × 97 columns

In [32]:
nums.columns.tolist()[:10]
Out[32]:
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9']
In [33]:
hist_columns = nums.columns.tolist()[:5]
#hist_columns.append('y')
hist_columns
Out[33]:
['x0', 'x1', 'x2', 'x3', 'x4']
In [34]:
nums[hist_columns].iplot(kind='hist')
In [36]:
nums_subset_plus_y.head(100).scatter_matrix() 
In [39]:
# Create distplot with custom bin_size
fig = ff.create_distplot([nums_subset_plus_y[c] for c in nums_subset_plus_y.columns[:3]], nums_subset_plus_y.columns[:3], bin_size=.25)

title = 'Example Distplot with First 3 Columns'
fig['layout'].update(title=title)


iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')

Clean Data

Pull clean copy of data and reclean for modeling (clean both train and test the same way)

In [40]:
train_df = pd.read_csv('data/exercise_01_train.csv')
test_df = pd.read_csv('data/exercise_01_test.csv')
In [41]:
print("Train Missing")
(train_df.isnull().sum()/train_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
Train Missing
Out[41]:
x96    0.000375
x0     0.000350
x55    0.000350
x18    0.000350
x62    0.000325
dtype: float64
In [42]:
print("Test Missing")
(test_df.isnull().sum()/test_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
Test Missing
Out[42]:
x73    0.0007
x97    0.0006
x24    0.0005
x76    0.0004
x13    0.0004
dtype: float64
In [43]:
def handle_missing_data(df, drop=False, fill=False, impute=False):
    if drop:
        return df.dropna()
    if fill:
        return df.fillna(df.mean())
    return df
    # add impute instructions
In [44]:
def relabel_data(df):
    df['x41'] = pd.to_numeric(df.x41.str.replace('$', ''))
    df['x45'] = pd.to_numeric(df.x45.str.replace('%', ''))
    df['x34'] = df.x34.str.lower()
    df['x35'] = df.x35.str.replace('wed', 'wednesday')
    df['x35'] = df.x35.str.replace('thur', 'thursday')
    df['x35'] = df.x35.str.replace('fri', 'friday')
    df['x68'] = df.x68.str.lower()
    df['x68'] = df.x68.str.replace('jun', 'june')
    df['x68'] = df.x68.str.replace('aug', 'august')
    df['x68'] = df.x68.str.replace('sept.', 'september')
    df['x68'] = df.x68.str.replace('apr', 'april')
    df['x68'] = df.x68.str.replace('oct', 'october')
    df['x68'] = df.x68.str.replace('mar', 'march')
    df['x68'] = df.x68.str.replace('nov', 'november')
    df['x68'] = df.x68.str.replace('feb', 'february')
    df['x68'] = df.x68.str.replace('dev', 'december') # guessing dev = december
    df['x93'] = df.x93.str.replace('euorpe', 'europe')
    return df
In [45]:
def encode_categoricals(objects, prefix=['cars', 'day', 'month', 'market']):
    objects = pd.get_dummies(objects, prefix, dummy_na=True)
    return objects
In [46]:
train_df = relabel_data(train_df)

objects = train_df.select_dtypes(include='object')
train_df = train_df.drop(objects.columns.tolist(), axis=1)

objects = encode_categoricals(objects)
train_df = pd.concat([train_df, objects], axis=1)

train_df = handle_missing_data(train_df, drop=True)
In [47]:
test_df = relabel_data(test_df)

objects = test_df.select_dtypes(include='object')
test_df = test_df.drop(objects.columns.tolist(), axis=1)

objects = encode_categoricals(objects)
test_df = pd.concat([test_df, objects], axis=1)

test_df = handle_missing_data(test_df, fill=True)
In [48]:
print("Train Missing")
(train_df.isnull().sum()/train_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
Train Missing
Out[48]:
market_nan    0.0
x51           0.0
x37           0.0
x38           0.0
x39           0.0
dtype: float64
In [49]:
train_df.shape
Out[49]:
(39232, 134)
In [50]:
print("Test Missing")
(test_df.isnull().sum()/test_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
Test Missing
Out[50]:
market_nan    0.0
x50           0.0
x36           0.0
x37           0.0
x38           0.0
dtype: float64
In [51]:
test_df.shape
Out[51]:
(10000, 133)
In [52]:
train_df.dtypes.value_counts()
Out[52]:
float64    96
uint8      37
int64       1
dtype: int64

The int64 type refers to the y label

In [53]:
test_df.dtypes.value_counts()
Out[53]:
float64    96
uint8      37
dtype: int64
In [54]:
y = train_df.pop('y')
In [55]:
train_df.to_csv('data/train.csv', index=False)
test_df.to_csv('data/test.csv', index=False)
In [56]:
y.to_csv('data/train_y.csv', index=False)
/Users/carrie/anaconda/envs/mlbook/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning:

The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.

Correlated Features

Correlation Does Not Equal Causation!

In [57]:
train_df_with_label = train_df.assign(y=y)
In [58]:
corr = train_df_with_label.corr()
In [59]:
corr[~corr.index.isin(['y'])].y.sort_values().iplot(kind='bar', title='Correlation w/Y Label')
In [60]:
corr.iplot(kind='heatmap', colorscale='spectral')

Save cleaned data